Implement double config for AdaptiveCrawler by Vaccarini-Lorenzo · Pull Request #1683 · unclecode/crawl4ai

Vaccarini-Lorenzo · 2025-12-28T15:19:14Z

Summary

Proposed solution to fix Issue #1682

List of files changed and why

File impacted: adaptive_crawler.py

AdaptiveConfig now supports two configs, one for embeddings and one for chat completion API

    config = AdaptiveConfig(
        strategy=strategy,
        max_pages=20,
        top_k_links=3,
        min_gain_threshold=0.05,
        embedding_llm_config=LLMConfig(
            provider='azure/text-embedding-3-small'
        ),
        # For query generation - use LLM models
        query_llm_config=LLMConfig(
            provider='azure/gpt-4.1'
        )
    )

The configs support provider, base_url and api_token.

How Has This Been Tested?

This is not a breaking change, it just adds an additional config option.
In realtion to Issue #1682 , the new proposed approach would be

"""
Comparison: Embedding vs Statistical Strategy

This example demonstrates the differences between statistical and embedding
strategies for adaptive crawling, showing when to use each approach.
"""

import asyncio
import time
import os
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig, AsyncLogger, LLMConfig
from crawl4ai.async_logger import LogLevel
import litellm

litellm._turn_on_debug()

logger = AsyncLogger(verbose=False, log_level=LogLevel.ERROR)

async def crawl_with_strategy(url: str, query: str, strategy: str):

    """Helper function to crawl with a specific strategy"""

    config = AdaptiveConfig(
        strategy=strategy,
        max_pages=20,
        top_k_links=3,
        min_gain_threshold=0.05,
        embedding_llm_config=LLMConfig(
            provider='azure/text-embedding-3-small',
            api_token='',
        ),
        # For query generation - use LLM models
        query_llm_config=LLMConfig(
            provider='azure/gpt-4.1',
            api_token='',
        )
    )
    
    async with AsyncWebCrawler(verbose=False, logger=logger) as crawler:
        adaptive = AdaptiveCrawler(crawler, config)
        
        start_time = time.time()
        result = await adaptive.digest(start_url=url, query=query)
        elapsed = time.time() - start_time
        
        return {
            'result': result,
            'crawler': adaptive,
            'elapsed': elapsed,
            'pages': len(result.crawled_urls),
            'confidence': adaptive.confidence
        }


async def main():
    """Compare embedding and statistical strategies"""
    
    # Test scenarios
    test_cases = [
        {
            'name': 'Technical Documentation (Specific Terms)',
            'url': 'https://docs.python.org/3/library/asyncio.html',
            'query': 'asyncio.create_task event_loop.run_until_complete'
        },
        {
            'name': 'Conceptual Query (Semantic Understanding)',
            'url': 'https://docs.python.org/3/library/asyncio.html',
            'query': 'concurrent programming patterns'
        },
        {
            'name': 'Ambiguous Query',
            'url': 'https://realpython.com',
            'query': 'python performance optimization'
        }
    ]

    
    for test in test_cases:
        print("\n" + "="*70)
        print(f"TEST: {test['name']}")
        print(f"URL: {test['url']}")
        print(f"Query: '{test['query']}'")
        print("="*70)
        
        # # Run statistical strategy
        # print("\n📊 Statistical Strategy:")
        # stat_result = await crawl_with_strategy(
        #     test['url'], 
        #     test['query'], 
        #     'statistical'
        # )
        
        # print(f"  Pages crawled: {stat_result['pages']}")
        # print(f"  Time taken: {stat_result['elapsed']:.2f}s")
        # print(f"  Confidence: {stat_result['confidence']:.1%}")
        # print(f"  Sufficient: {'Yes' if stat_result['crawler'].is_sufficient else 'No'}")
        
        # # Show term coverage
        # if hasattr(stat_result['result'], 'term_frequencies'):
        #     query_terms = test['query'].lower().split()
        #     covered = sum(1 for term in query_terms 
        #                  if term in stat_result['result'].term_frequencies)
        #     print(f"  Term coverage: {covered}/{len(query_terms)} query terms found")
        
        # Run embedding strategy
        print("\n🧠 Embedding Strategy:")
        emb_result = await crawl_with_strategy(
            test['url'], 
            test['query'], 
            'embedding'
        )
        
        print(f"  Pages crawled: {emb_result['pages']}")
        print(f"  Time taken: {emb_result['elapsed']:.2f}s")
        print(f"  Confidence: {emb_result['confidence']:.1%}")
        print(f"  Sufficient: {'Yes' if emb_result['crawler'].is_sufficient else 'No'}")
        
        # Show semantic understanding
        if emb_result['result'].expanded_queries:
            print(f"  Query variations: {len(emb_result['result'].expanded_queries)}")
            print(f"  Semantic gaps: {len(emb_result['result'].semantic_gaps)}")
        
        # Compare results
        print("\n📈 Comparison:")
        efficiency_diff = ((stat_result['pages'] - emb_result['pages']) / 
                          stat_result['pages'] * 100) if stat_result['pages'] > 0 else 0
        
        print(f"  Efficiency: ", end="")
        if efficiency_diff > 0:
            print(f"Embedding used {efficiency_diff:.0f}% fewer pages")
        else:
            print(f"Statistical used {-efficiency_diff:.0f}% fewer pages")
        
        print(f"  Speed: ", end="")
        if stat_result['elapsed'] < emb_result['elapsed']:
            print(f"Statistical was {emb_result['elapsed']/stat_result['elapsed']:.1f}x faster")
        else:
            print(f"Embedding was {stat_result['elapsed']/emb_result['elapsed']:.1f}x faster")
        
        print(f"  Confidence difference: {abs(stat_result['confidence'] - emb_result['confidence'])*100:.0f} percentage points")
        
        # Recommendation
        print("\n💡 Recommendation:")
        if 'specific' in test['name'].lower() or all(len(term) > 5 for term in test['query'].split()):
            print("  → Statistical strategy is likely better for this use case (specific terms)")
        elif 'conceptual' in test['name'].lower() or 'semantic' in test['name'].lower():
            print("  → Embedding strategy is likely better for this use case (semantic understanding)")
        else:
            if emb_result['confidence'] > stat_result['confidence'] + 0.1:
                print("  → Embedding strategy achieved significantly better understanding")
            elif stat_result['elapsed'] < emb_result['elapsed'] / 2:
                print("  → Statistical strategy is much faster with similar results")
            else:
                print("  → Both strategies performed similarly; choose based on your priorities")
    
    # Summary recommendations
    print("\n" + "="*70)
    print("STRATEGY SELECTION GUIDE")
    print("="*70)
    print("\n✅ Use STATISTICAL strategy when:")
    print("  - Queries contain specific technical terms")
    print("  - Speed is critical")
    print("  - No API access available")
    print("  - Working with well-structured documentation")
    
    print("\n✅ Use EMBEDDING strategy when:")
    print("  - Queries are conceptual or ambiguous")
    print("  - Semantic understanding is important")
    print("  - Need to detect irrelevant content")
    print("  - Working with diverse content sources")


if __name__ == "__main__":
    asyncio.run(main())

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

ntohidi · 2026-02-25T03:37:14Z

@Vaccarini-Lorenzo
Thanks for the PR, separating embedding_llm_config and query_llm_config is the right approach to fix this issue. However, there are a few issues that need to be addressed before we can merge:

Runtime bugs:

_embedding_llm_config_dict was renamed to _llm_config_dict on AdaptiveConfig, but _get_embedding_llm_config_dict() in EmbeddingStrategy still references self.config._embedding_llm_config_dict (the old name). This will raise AttributeError.
_get_query_llm_config_dict() references self.config._query_llm_config_dict, but this property was never added to AdaptiveConfig. Same AttributeError.

Behavioral change:

The old _get_embedding_llm_config_dict() returned None when no config was set, which made get_text_embeddings() use local sentence-transformers (no API key needed). The new fallback defaults to openai/text-embedding-3-small, which would break existing users who don't have an OpenAI API key. The fallback should remain None to preserve the local embedding behavior.

Minor:

Line 182: query_llm_config comment says "Separate config for embeddings" — should say "for query generation".
Lines 618-622: the leftover commented-out code block should be removed.

Could you take a look at these? The main fixes needed are adding the missing _query_llm_config_dict property on AdaptiveConfig, fixing the renamed property reference, and preserving the None fallback for local embeddings.

@sthakrar

…ion (#1682) The embedding strategy uses two incompatible API call types: embedding calls (text-to-vector) and query expansion (chat completion). Previously both used a single embedding_llm_config, so setting an embedding model broke query expansion and vice versa. Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users can specify separate models for each call type. Fallback chain preserves backward compatibility: query_llm_config -> llm_config -> hardcoded defaults. Also fixes base_url and backoff params not being passed to perform_completion_with_backoff in query expansion, and simplifies _embedding_llm_config_dict to use LLMConfig.to_dict() (which includes the 3 backoff fields the manual extraction was missing). Inspired by PR #1683 from @sthakrar — thank you for identifying the issue and proposing the initial approach.

unclecode · 2026-02-25T12:28:00Z

Hey @Vaccarini-Lorenzo - thank you for filing #1682 and this PR! You identified a real design gap and the query_llm_config approach you proposed was exactly the right solution.

We went ahead and landed this in develop (a4cc0a9) with a clean-room implementation that addresses the issues @ntohidi flagged (missing properties, broken fallback chain, behavioral change on local embeddings). Specifically:

Added query_llm_config field on AdaptiveConfig + _query_llm_config_dict property
Added _get_query_llm_config_dict() on EmbeddingStrategy with a proper fallback chain: explicit query_llm_config -> AdaptiveConfig -> legacy llm_config -> None (preserves local embedding behavior)
Simplified _embedding_llm_config_dict to use LLMConfig.to_dict() (fixes missing backoff params)
Fixed base_url and backoff params not being passed to perform_completion_with_backoff in query expansion
Added e2e tests and updated docs

The target API is exactly what you proposed:

AdaptiveConfig(
    embedding_llm_config=LLMConfig(provider='openai/text-embedding-3-small'),
    query_llm_config=LLMConfig(provider='openai/gpt-4o-mini'),
)

Since the fix is already on develop, we'll close this PR - but your contribution was instrumental in getting this done. The commit credits you for the original idea. We'd love to see more contributions from you!

Closing in favor of a4cc0a9. Also closing #1682 as fixed.

Vaccarini-Lorenzo · 2026-02-25T13:18:13Z

Hi @unclecode
Thank you so much for making my contribution possible, this project is just amazing!

P.s.
I think that you made a typo in commit a4cc0a9

Inspired by PR https://github.com/unclecode/crawl4ai/pull/1683 from <TYPO>@sthakrar</TYPO> <FIX>@Vaccarini-Lorenzo</FIX> — thank you for identifying the
issue and proposing the initial approach.

unclecode · 2026-02-27T12:23:45Z

@Vaccarini-Lorenzo Good catch on the typo - that's embarrassing! I'll fix the commit message to properly credit you instead of @sthakrar. Apologies for the mixup.

And thanks again for the contribution - the separate query_llm_config / embedding_llm_config design you proposed was clean and exactly what we needed. Glad to have it in the codebase.

By the way - we're building out Crawl4AI Cloud and starting paid collaborations with contributors who know the system well. If that's something you'd be interested in, send an email to aravind@crawl4ai.com (cc: unclecode@crawl4ai.com) and we can chat.

@Vaccarini-Lorenzo

…ion (#1682) The embedding strategy uses two incompatible API call types: embedding calls (text-to-vector) and query expansion (chat completion). Previously both used a single embedding_llm_config, so setting an embedding model broke query expansion and vice versa. Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users can specify separate models for each call type. Fallback chain preserves backward compatibility: query_llm_config -> llm_config -> hardcoded defaults. Also fixes base_url and backoff params not being passed to perform_completion_with_backoff in query expansion, and simplifies _embedding_llm_config_dict to use LLMConfig.to_dict() (which includes the 3 backoff fields the manual extraction was missing). Inspired by PR #1683 from @Vaccarini-Lorenzo — thank you for identifying the issue and proposing the initial approach.

unclecode · 2026-02-27T12:34:02Z

@Vaccarini-Lorenzo Fixed! The commit message now correctly credits you. Thanks for flagging it.

Vaccarini-Lorenzo added 2 commits December 28, 2025 16:07

Implement double config for AdaptiveCrawler

ac59668

Remove useless print comment

a8523d1

Vaccarini-Lorenzo mentioned this pull request Jan 15, 2026

[Bug]: EmbeddingStrategy.llm_config is used for both query generation and embeddings #1682

Closed

ntohidi changed the base branch from main to develop February 25, 2026 02:42

unclecode closed this Feb 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement double config for AdaptiveCrawler#1683

Implement double config for AdaptiveCrawler#1683
Vaccarini-Lorenzo wants to merge 2 commits intounclecode:developfrom
Vaccarini-Lorenzo:main

Vaccarini-Lorenzo commented Dec 28, 2025

Uh oh!

ntohidi commented Feb 25, 2026

Uh oh!

unclecode commented Feb 25, 2026

Uh oh!

Vaccarini-Lorenzo commented Feb 25, 2026

Uh oh!

unclecode commented Feb 27, 2026

Uh oh!

unclecode commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Vaccarini-Lorenzo commented Dec 28, 2025

Summary

List of files changed and why

How Has This Been Tested?

Checklist:

Uh oh!

ntohidi commented Feb 25, 2026

Uh oh!

unclecode commented Feb 25, 2026

Uh oh!

Vaccarini-Lorenzo commented Feb 25, 2026

Uh oh!

unclecode commented Feb 27, 2026

Uh oh!

unclecode commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants